{
    "cells": [
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "# Phoenix Evals Quickstart\n",
                "\n",
                "This quickstart shows how Phoenix helps you evaluate data from your LLM application (e.g., inputs, outputs, retrieved documents).\n",
                "\n",
                "You will:\n",
                "\n",
                "- Export a dataframe from your Phoenix session that contains traces from an instrumented LLM application,\n",
                "- Evaluate your trace data for:\n",
                "  - Relevance: Are the retrieved documents grounded in the response?\n",
                "  - Q&A correctness: Are your application's responses grounded in the retrieved context?\n",
                "  - Hallucinations: Is your application making up false information?\n",
                "- Ingest the evaluations into Phoenix to see the results annotated on the corresponding spans and traces.\n",
                "\n",
                "Let's get started!\n",
                "\n",
                "First, install Phoenix with `pip install arize-phoenix`.\n",
                "\n",
                "To get you up and running quickly, we'll download some pre-existing trace data collected from a LlamaIndex application (in practice, this data would be collected by instrumenting your LLM application with an OpenInference-compatible tracer).  # TODO: Add link"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from urllib.request import urlopen\n",
                "\n",
                "from phoenix.trace.trace_dataset import TraceDataset\n",
                "from phoenix.trace.utils import json_lines_to_df\n",
                "\n",
                "traces_url = \"https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/trace.jsonl\"\n",
                "with urlopen(traces_url) as response:\n",
                "    lines = [line.decode(\"utf-8\") for line in response.readlines()]\n",
                "trace_ds = TraceDataset(json_lines_to_df(lines))"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Launch Phoenix. You can open use Phoenix within your notebook or in a separate browser window by opening the URL."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import phoenix as px\n",
                "\n",
                "(session := px.launch_app(trace=trace_ds)).view()\n",
                "session.view()"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You should now see a view like this.\n",
                "\n",
                "![A view of the Phoenix UI prior to adding evaluation annotations](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/traces_without_evaluation_annotations.png)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Export your retrieved documents and query data from your session into a pandas dataframe.\n",
                "\n",
                "Note: If you are interested in a different subset of your data, you can export with a custom query.  # TODO: Add link"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents\n",
                "\n",
                "retrieved_documents_df = get_retrieved_documents(px.Client())\n",
                "queries_df = get_qa_with_reference(px.Client())"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Phoenix evaluates your application data by prompting an LLM to classify whether a retrieved document is relevant or irrelevant to the corresponding query, whether a response is grounded in a retrieved document, etc. You can even get explanations generated by the LLM to help you understand the results of your evaluations!\n",
                "\n",
                "This quickstart uses OpenAI and requires an OpenAI API key, but we support a wide variety of APIs and models.  # TODO: Add link\n",
                "\n",
                "Install the OpenAI SDK with `pip install openai` and instantiate your model."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.evals import OpenAIModel\n",
                "\n",
                "api_key = None  # set your api key here or with the OPENAI_API_KEY environment variable\n",
                "eval_model = OpenAIModel(model_name=\"gpt-4-turbo-preview\", api_key=api_key)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You'll next define your evaluators. Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.\n",
                "\n",
                "![A diagram depicting how evaluators are composed of LLMs and evaluation prompt templates and product labels, scores, and explanations from input data (e.g., queries, references, outputs, etc.)](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/evaluators_diagram.png)"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "from phoenix.evals import (\n",
                "    HallucinationEvaluator,\n",
                "    QAEvaluator,\n",
                "    RelevanceEvaluator,\n",
                ")\n",
                "\n",
                "hallucination_evaluator = HallucinationEvaluator(eval_model)\n",
                "qa_correctness_evaluator = QAEvaluator(eval_model)\n",
                "relevance_evaluator = RelevanceEvaluator(eval_model)"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Run your evaluations."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "import nest_asyncio\n",
                "\n",
                "from phoenix.evals import (\n",
                "    run_evals,\n",
                ")\n",
                "\n",
                "nest_asyncio.apply()  # needed for concurrency in notebook environments\n",
                "\n",
                "hallucination_eval_df, qa_correctness_eval_df = run_evals(\n",
                "    dataframe=queries_df,\n",
                "    evaluators=[hallucination_evaluator, qa_correctness_evaluator],\n",
                "    provide_explanation=True,\n",
                ")\n",
                "relevance_eval_df = run_evals(\n",
                "    dataframe=retrieved_documents_df,\n",
                "    evaluators=[relevance_evaluator],\n",
                "    provide_explanation=True,\n",
                ")[0]"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Log your evaluations to your running Phoenix session."
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "# Log span evaluations\n",
                "from phoenix.client import AsyncClient\n",
                "\n",
                "px_client = AsyncClient()\n",
                "\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=hallucination_eval_df,\n",
                "    annotation_name=\"Hallucination\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")\n",
                "await px_client.spans.log_span_annotations_dataframe(\n",
                "    dataframe=qa_correctness_eval_df,\n",
                "    annotation_name=\"QA Correctness\",\n",
                "    annotator_kind=\"LLM\",\n",
                ")\n",
                "\n",
                "# Log document evaluations\n",
                "import phoenix as px\n",
                "from phoenix.trace import DocumentEvaluations\n",
                "\n",
                "px.Client().log_evaluations(\n",
                "    DocumentEvaluations(eval_name=\"Relevance\", dataframe=relevance_eval_df),\n",
                ")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "Your evaluations should now appear as annotations on your spans in Phoenix!"
            ]
        },
        {
            "cell_type": "code",
            "execution_count": null,
            "metadata": {},
            "outputs": [],
            "source": [
                "print(f\"🔥🐦 Open back up Phoenix in case you closed it: {session.url}\")"
            ]
        },
        {
            "cell_type": "markdown",
            "metadata": {},
            "source": [
                "You can view aggregate evaluation statistics, surface problematic spans, understand the LLM's reason for each evaluation by reading the corresponding explanation, and pinpoint the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's poor responses.\n",
                "\n",
                "![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)"
            ]
        }
    ],
    "metadata": {
        "kernelspec": {
            "display_name": "llmapps",
            "language": "python",
            "name": "python3"
        },
        "language_info": {
            "codemirror_mode": {
                "name": "ipython",
                "version": 3
            },
            "file_extension": ".py",
            "mimetype": "text/x-python",
            "name": "python",
            "nbconvert_exporter": "python",
            "pygments_lexer": "ipython3",
            "version": "3.10.12"
        }
    },
    "nbformat": 4,
    "nbformat_minor": 2
}
